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Markov Sources in Single and Multi-Terminal 
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Abstract — The optimal causal (zero-delay) coding of a partially 
observed Markov process is studied, where the cost to be min- 
imized is a bounded, non-negative, additive, measurable single- 
letter function of the source and the receiver output. A struc- 
tural result is obtained extending Witsenhausen's and Walrand- 
Varaiya's structural results on optimal causal coders to more 
general state spaces and to a partially observed setting. The 
decentralized (multi-terminal) setup is also considered. For the 
case where the source is an i.i.d. process, it is shown that an 
optimal solution to the decentralized causal coding of corre- 
lated observations problem is memoryless. For Markov sources, 
a counterexample to a natural separation conjecture is presented. 



I. Introduction 

This paper considers optimal causal encoding/quantization 
of partially observed Markov processes. We begin with pro- 
viding a description of the system model. We consider a par- 
tially observed Markov process, defined on a probability space 
(f2, J 7 , P) and described by the following discrete-time equa- 
tions for t > 0: 



Xt+l 

Vl 



f{xt,wt), 



(1) 

(2) 



for (Borel) measurable functions /, g\i = 1, 2, with {wt, r z t ,i = 
1,2} i.i.d., mutually independent noise processes and xq a 
random variable with probability measure v$. Here, we let 
Xt G X, and y\ G Y\ where X,Y l are complete, separa- 
ble, metric spaces (Polish spaces), and thus, include countable 
spaces or E™, n G N + . 

Let an encoder, Encoder i, be located at one end of an ob- 
servation channel characterized by @. The encoders transmit 
their information to a receiver (see Figure [T), over a discrete 
noiseless channel with finite capacity; that is, they quantize 
their information. The information at the encoders may also 
contain feedback from the receiver, which we clarify in the 
following. 

Let us first define a quantizer. 
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Fig. 1: Partially observed source under a decentralized struc- 
ture. 



Definition 1.1: Let M = {1,2,..., M} with M = \M\. 
Let A be a topological space. A quantizer Q(A; M) is a Borel 
measurable map from A to o 

When the spaces A and M. are clear from context, we will 
drop the notation and denote the quantizer simply by Q. 

We refer by a Composite Quantization Policy n comp i of 
Encoder i, a sequence of functions {Q1 omp,^ , t > 0} which are 
causal such that the quantization output at time t, q\, under 
jpomp,j j s g enera ted by a causally measurable function of its 
local information, that is, a mapping measurable on the sigma- 
algebra generated by 

It = {y[0,t]>?[0,i-l]>9[0,i-l]}' t>l, 

and Iq — {i/q}, to a finite set M\, the quantization output 
alphabet at time t given by 

Ml :={l,2,...,\Ml\}, 

for < t < T - 1 and i = 1,2. Here {M\} are fixed in 
advance and do not depend on the realizations of the random 
variables. Here, we have the notation for t > 1: 



y i [0 , t - 1] = {yi,o<s<t-i}. 



Let 



Y[MlxM 2 s ) x(T) f+ \ t>l, I*=Y\ 

be information spaces such that for all t > 0, 1\ G 1\. Thus, 
Qf np ' i :lj ->M\. 
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We may express, equivalently, the policy n corrw as a com- 
position of a Quantization Policy IP and a Quantizer. A 

quantization policy of encoder i, T\ is a sequence of functions 
{T/}, such that for each t > 0, T t * is a mapping from the 
information space 1\ to a space of quantizers Q\. A quantizer, 
subsequently is used to generate the quantizer output. That is 
for every t and i, Tl(I t ) 6 QJ and for every P t 6 I J, we will 
adopt a following representation 

QT mp '\n) = (Timai), e) 

mapping the information space to M.\ in its most general form. 
We note that even though there may seem to be duplicated 
information in © (since a map is used to pick a quantizer, 
and the quantizer maps the available information to outputs) 
we will eliminate any informational redundancy: A quantizer 
action will be generated based on the common information at 
the encoder and the receiver, and the quantizer will map the 
relevant private information at the encoder to the quantization 
output. Such a separation in the design will also allow us to 
use the machinery of Markov Decision Processes to obtain a 
structural method to design optimal quantizers, to be clarified 
further, without any loss in optimality. 

That is, let the information at the receiver at time t > 
be II — {(ftp t _]i, <Zp t _i]}' The common information, under 
feedback information, in the encoders and the receiver is the 

set I\ 6 ^Ill=o-^s x ■Ma^' Thus, we can express the 

composite quantization policy as: 

Q^\ii) = {Tt{ii)){ii\r t ), (4) 

mapping the information space to M.\. We note that, any 
composite quantization policy Q^ om P- 1 can be expressed in the 
form above; that is there is no loss in the space of possible 
such policies, since for any Q c t om P- 1 ^ one CO uld define 

Ti(IZ)(-):=Q? mp >\IZ,.). 

Thus, we let DM 4 have policy T % and under this policy gen- 
erate quantizer actions {Q\,t > 0}, Q\ 6 <Ql (Q\ is the 
quantizer used at time t). Under action Q\, and given the 
local information, the encoder generates q\, as the quantization 
output at time t. 

The receiver, upon receiving the information from the en- 
coders, generates its decision at time t, also causally: An 
admissible causal receiver policy is a sequence of measurable 
functions 7 = {7 f } such that 

It : I] (M\ x Ml) -»■ U, t > 

s=0 ^ ' 

where U denotes the decision space. 

For a general vector a, let a denote {a 1 , a 2 } and let II = 
{IT 1 , IT 2 } denote the ensemble of policies and Q f = {Ql,Q 2 }. 
Hence, q [0>t] denotes {q\ 0it] , gf 0jt] }. 

With the above formulation, the objective of the decision 
makers is the following minimization problem: 

T-l 

7 t=o 



over all policies n comp , 7 with the random initial condition xq 
having probability measure v . Here c(-, •), is a non-negative, 
bounded, measurable function and v t = 7t(q[o,t]) f° r t > 0. 

We also assume that the encoders and the receiver know the 
apriori distribution v§. 

Before concluding this section, it may be worth emphasizing 
the operational nature of causality; as different approaches 
have been adopted in the literature. The encoders at any given 
time can only use their local information to generate the quan- 
tization outputs. The receiver, at any given time, can only use 
its local information to generate its decision/estimate. These 
happen with zero delay, that is if there is a common-clock 
at the encoders and the receiver; the receiver at time t needs 
to make its decision before the realizations Xt+i, yj+ii Vt+ 1 
have taken place. This corresponds to the zero-delay coding 
schemes of, for example, Witsenhausen and Linder-Lugosi in 
Il52l . (29); but is different from the setup of Neuhoff and 
Gilbert PTfl , which allows long delays at the decoder. Our 
motivation for such zero-delay schemes comes mainly from 
applications in networked control systems, when sensors need 
to transmit information to controllers who need to act on a 
system; such systems cannot tolerate long delays, in particular 
when the source is open loop unstable and disturbance exists 
in the evolution of the source. 

A. Relevant literature 

Some related studies of the above setup include optimal 
control with multiple sensors and sequential decentralized hy- 
pothesis testing problems and multi-access communications 
with feedback [3|. Related papers on real-time coding include 
the following: |40| established that the optimal optimal causal 
encoder minimizing the data rate subject to a distortion for 
an i.i.d sequence is memoryless. If the source is fcth-order 
Markov, then the optimal causal fixed-rate coder minimizing 
any measurable distortion uses only the last k source symbols, 
together with the current state at the receiver's memory |52|. 
Reference [46 1 considered the optimal causal coding problem 
of finite-state Markov sources over noisy channels with noise- 
less feedback. [43 1 and [37| considered optimal causal coding 
of Markov sources over noisy channels without feedback. [36| 
considered the optimal causal coding over a noisy channel with 
noisy feedback. Reference 11301 considered the causal coding 
of stationary sources under a high-rate assumption. 

Our paper is particularly related to the following two efforts 
in the literature: Borkar, Mitter and Tatikonda [11] studies a re- 
lated problem of coding of a partially observed Markov source, 
however, the construction for the encoders is restricted to take 
a particular form which uses the information at the decoder 
and the most recent observation at the encoder (not including 
the observation history). As another point of relevance with 
our paper, [11] regarded the actions as the quantizer functions, 
which we will discuss further. In contrast, the only restriction 
we have in this paper is causality, in the zero-delay sense. 
On the other hand, we do not claim the existence results that 
the authors in ifTTI are making. Another work in the literature 
which is related to ours is by Nayyar and Teneketzis [38|, 
considering a multi-terminal setup. ll38l considers decentral- 
ized coding of correlated sources when the encoders observe 
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conditionally independent messages given a finitely valued 
random variable and obtain separation results for the optimal 
encoders. The paper also considers noisy channels. In our 
setup, there does not exist a finitely valued random variable 
which makes the observations at the encoders conditionally 
independent. 

References [47 ] and |3 1 ] consider optimal causal variable- 
rate coding under side information and [55 1 considers optimal 
variable-rate causal coding under distortion constraints. The 
studies in [31] and [55 1 are in the context of real-time, zero- 
delay settings; whereas [47] considers causality in the sense of 
Neuhoff and Gilbert l40l as discussed in the previous section. 

We will also obtain structural results for optimal decen- 
tralized coding of i.i.d. sources. There are algorithmic results 
available in the literature when the encoders satisfy the optimal 
structure obtained in the paper, important resources in this 
direction include OH, ED and EH). 

A parallel line of consideration which has a rate-distortion 
theoretic nature is on sequential- rate distortion proposed in 
[42 1 and the feedforward setup, which has been investigated 
in El and ifTTl . 

Our work is also related to Witsenhausen's indirect rate 
distortion problem [51] (see also USD . We will observe that, 
the separation argument through the disconnection principle 
of BTI applies to our setting in a dynamic context. Further 
related papers include J6), 11281 . 

In our paper, we also use ideas from team decision theory, 
see El, E3), |34|, E3 and US for related discussions and 
applications. 

B. Contributions of the paper 

* The optimal causal coding of a partially observed Markov 
source is considered. For the single terminal case, a struc- 
tural result is obtained extending Witsenhausen's and Wal- 
rand and Varaiya's structural results on optimal causal 
(zero-delay) coders to a partially observed setting and to 
sources which take values in a Polish space. We show 
that a separation result of a form involving the decoder's 
belief on the encoder's belief on the state is optimal. 

* The decentralized (multi-terminal) setup is also consid- 
ered. For the case where the source is an i.i.d. process, it 
is shown that the optimal decentralized causal coding of 
correlated observations problem admits a solution which 
is memoryless. For Markov sources, a counterexample to 
a natural separation conjecture is presented. The decen- 
tralized control concept of signaling is interpreted in the 
context of decentralized coding. 

> The results are applied to a Linear-Quadratic-Gaussian 
(LQG) estimation/optimization problem. The results above 
induce an optimality result for separation of estimation 
and quantization, where the estimation is obtained with 
a Kalman Filter and the filter output is quantized. 

We now summarize the rest of the paper. In section II, we 
present our results on optimal coding of a partially observed 
Markov process when there is only one encoder. Section III 
discusses the decentralized setting for a multi-encoder setup 
and presents a counterexample for a separation conjecture and 



provides a separation result when the source is memoryless. 
The paper ends with the concluding remarks of section V, 
following an application example on linear, Gaussian systems 
in Section IV. The proofs of the results are presented in the 
Appendix. 

II. Single-Terminal Case: Optimal Causal Coding 
of a Partially Observed Markov Source 

A. Revisiting the single-terminal, fully observed case 

Let us revisit the single-encoder, fully observed case: In this 
setup, y t = Xt for all t > 0. There are two related approaches 
in the literature as presented explicitly by Teneketzis in |43|; 
one adopted by Witsenhausen ll52l and one by Walrand and 
Varaiya 11461 . Reference [43 1 extended the setups to the more 
general context of non-feedback communication. 

Theorem 2.1: [Witsenhausen [52 1] Any (causal) compos- 
ite quantization policy can be replaced, without any loss in 
performance, by one which only uses xt and q[o.t-i] at time 
t > 1. o 

Walrand and Varaiya considered sources living in a finite 
set, and obtained the following: 

Theorem 2.2: [Walrand- Varaiya fl46l l Any optimal (causal) 
composite quantization policy can be replaced, without any 
loss in performance, by one which only uses the conditional 
probability measure P(xt\q[o,t-i]), the state Xt, and the time 
information t, at time t > 1. o 

The difference between the structural results above is the 
following: In the setup suggested by Theorem 12.11 the en- 
coder's memory space is not fixed and keeps expanding as the 
decision horizon in the optimization, T — 1, increases. In The- 
orem |22] the memory space of an optimal encoder is fixed. In 
general, the space of probability measures is a very large one; 
however, it may be the case that different quantization outputs 
may lead to the same conditional probability measure on the 
state process, leading to a reduction in the required memory. 
Furthermore, Theorem 12.21 allows one to apply the theory of 
Markov Decision Processes, for infinite horizon problems. We 
note that [ 1 1 1 applied such a machinery to obtain existence 
results for optimal causal coding of partially observed Markov 
processes 

B. Optimal causal coding of a partially observed Markov source 

Consider the setup earlier in EJ with a single encoder. 
Thus, the system considered is a discrete-time scalar system 
described by 

xt+i = f(x t) w t ), y t =g(x t ,rt), (5) 

where x t is the state at time t, and {w t ,r t } is a sequence of 
zero-mean, mutually independent, identically distributed (i.i.d.) 
random variables with finite second moments. Let the quan- 
tizer, as described earlier, map its information to a finite set 
Ait- At any given time, the receiver generates a quantity vt as 
a function of its received information, that is as a function of 
{q ,qi,...,q t }. The goal is to minimize ^ t J~ E[c(x t ,v t )], 
subject to constraints on the number of quantizer bins in M. t , 
and the causality restriction in encoding and decoding. 
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Let for a Polish space S, V(S) be the space of probability 
measures on B(S), the Borel a— field on S (generated by open 
sets in §). At this point pause to provide a brief discussion on 
the space V(S). 

Let r(§) be the set of all Borel measurable and bounded 
functions from S to R. We first wish to find a topology on 
■p(S), under which functions of the form: 

6 := { ( n(dx)f(x), f £ r(x)} 
Jxe§ 

are measurable on V(§). We will need this to construct the 
structure of optimal quantizers later in this section. 

Let {/i„, n 6 N} be a sequence in V(E>). Recall that {/i„} 
is said to converge to /i £ V(S) weakly if 



c(x)n n (dx) 



— > 



c(x)fi(dx) 



for every continuous and bounded c : S -> 1. The sequence 
{/J. n } is said to converge to /i £ V(S) setwise if 



c(x)n n (dx) 



— > 



c(x)fi(dx) 



for every measurable and bounded c : § — > R. For two proba- 
bility measures v £ V(E>), the total variation metric is given 
by 



\H - v\\tv 



2 sup \n(B)-v(B)\ 

BeB(S) 



sup 

/:||/II~<1 



f(x)fi(dx) - / f(x)v(dx) 



where the infimum is over all measurable real / such that 
ll/lloo = sup se § |/(x)| < 1. A sequence {^„} is said to 
converge to /i £ V(E>) in total variation if \\fi n — h\\tv 0. 

These three convergence notions are in increasing order of 
strength: convergence in total variation implies setwise conver- 
gence, which in turn implies weak convergence. Total variation 
is a very strong notion for convergence. Furthermore, the space 
of probability measures under total variation metric is not 
separable. Setwise convergence also induces an inconvenient 
topology on the space of probability measures, particularly 
because this topology is not metrizable ( 11201 p. 59]). However, 
the space of probability measures on a complete, separable, 
metric (Polish) space endowed with the topology of weak 
convergence is itself a complete, separable, metric space 2). 
The Prohorov metric, for example, can be used to metrize this 
space, among other metrics. This topology has found many 
applications in information theory and stochastic control. For 
these reasons, one would like to work with weak convergence. 

By the above definitions, it is evident that both setwise con- 
vergence and total variation are sufficient for measurability of 
the function class 0, since under these topologies J Tr(dx)f(x) 
is (sequentially) continuous on 'P(S) for every / £ T(S). 
However, as we state in the following, weak convergence is 
also sufficient (see Theorem 15.13 in flU or p. 215 in ||8)). 

Theorem 2.3: Let S be a Polish space and let M(S) be the 
set of all measurable and bounded functions / : § — > R under 
which 

Tr(dx)f(x) 



defines a measurable function on V(S) under the weak con- 
vergence topology. Then, M(§) is the same as of all 
bounded and measurable functions. o 
Hence, V(S) will denote the space of probability measures 
on § under weak convergence. Now, define 7r t £ V(K) to be 
the regular conditional probability measure given by 

n t (A) = P(xt £ A\y m ), At BpS). 

The existence of this regular conditional probability measure 
for every realization y[ .t] follows from the fact that both 
the state process and the observation process are Polish. It is 
known that the process {ir t } evolves according to a non-linear 
filtering equation (see (TBI), and is itself a Markov process (see 

did. ma, id i). 

Let us also define S t £ V(V(*K)) as the regular conditional 



E t (A) = P(n £ A^ct-i]), A £ B(V(X)). 

The following are the main results of this section: 
Theorem 2.4: Any (causal) composite quantization policy 
can be replaced, without any loss in performance, by one 
which only uses {irt, <7[o,*-i]} as a sufficient statistic for t > 1. 
This can be expressed as a quantization policy which only uses 
Q[o,t-i] to generate a quantizer, where the quantizer uses 7r t 
to generate the quantization output at time t. o 
Theorem 2.5: Any optimal (causal) composite quantiza- 
tion policy can be replaced, without any loss in performance, 
by one which only uses {S t ,7r t ,i} for t > 1. This can be 
expressed as an optimal quantization policy which only uses 
to generate an optimal quantizer, where the quantizer 
uses TT t to generate the quantization output at time t. o 
The proofs of the results above are presented in the Ap- 
pendix. We present two remarks in the following: 

Remark 2.1: Our results above are not surprising. In fact, 
once one recognizes the fact that {7r t } forms a Markov source, 
and the cost function can be expressed as a function c(ir, v), 
for some function c : "P(X) xU — ^ R, one could almost directly 
apply Witsenhausen's ll52l as well as Walrand and Varaiya's 
[46 1 results to recover the structural results above (except the 
fact that Walrand and Varaiya consider sources living in a 
finite alphabet). The proofs in the Appendix are presented for 
completeness and to address the technical intricacies. o 
Remark 2.2: Having the actions as the quantizers, and not 
the quantizer outputs, allows one to define a Markov Decision 
Problem with well-defined cost functions and state and action 
spaces. By the proof of Theorem 12.51 we will observe that 
(2,Qt) forms a Markov Chain, which is a key observation: 
Thus, the action space can be constructed as some topolog- 
ical space of quantizers acting on V(X). Borkar, Mitter and 
Tatikonda [ 1 1 1 adopted this view while formulating an MDP 
optimization problem, where the quantizer acts on Y (As men- 
tioned earlier, our separation result is different from [ 1 1 1 due 
to the structure imposed on the quantizers in Hill .'). See also 
and ll58l for a topology on quantizers. o 
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C. Extensions to finite delay decoding and higher order Markov 
sources 

The results presented are also generalizable to settings where, 
a) the source is Markov of order m > 0, b) a finite delay d is 
allowed at the decoder, and c) the observation process depends 
also on past source outputs in a sense described in © below. 
For these cases, we consider the following generalization of 
the source by expanding the state space. 

Suppose that the partially observed source is such that, 
the source is Markov of order m, or there is a finite delay 
d > which is allowed at the decoder. In this case, we 
can augment the source to obtain z t = {x[ t _ max ( d+ i im)+M ]}. 
Note that {z t } is Markov. We can thus consider the following 
representation: 

z t+i = f(zt,w t ), y t =g(z t ,r t ), (6) 

where z t = {x [t _ max(d+hm)+lit] } G I max(<i+1 ' m) , and r t ,w t 
are mutually independent, i.i.d. processes. 

Any per-stage cost function of the form c(xt,v t ) can be 
written as for some c: c(z t , v t ). For the finite delay case, the 
cost per-stage can further be specialized as c(xt—d> Vt). For the 
Markov case with memory, the cost function per-stage writes 
as c(x [t _ m+1>t] ,v t ). Now, by replacing X with x max ( d+1 < m ), 
let TT t G p(X max ( d+1 ' m )) be given by 

n (A) = P(z t € i4|y [0)t] ), A G B (x max ( d + 1 '-)) 

and St G p(7?(X max ( d+1 ^ 1 ))) be the regular conditional mea- 
sure defined by 

E t (A) = P(n t G Algp,*.!]), A G B (V(X^ d+1 ^)). 

Hence, we have the following result, which is a direct ex- 
tension of Theorems 12.41 and 12.51 

Theorem 2.6: Suppose that the partially observed source 
is such that, the source is Markov of order m, or there is 
a finite delay d > which is allowed at the decoder. With 
Zt = {ai[t-max(<i+i,m)+i,t]}» Vt satisfies ©. Then, we have 
the following extensions: 

i) Any causal composite quantization policy can be replaced, 
without any loss in performance, by one which only uses 
{•7Tt, qtpj-i]} as a sufficient statistic for t > 1. This can be 
expressed as a quantization policy which only uses <7[o,t_i] to 
generate a quantizer, where the quantizer uses 7r f to generate 
the quantization output at time t. 

ii) Any optimal causal composite quantization policy can be 
replaced, without any loss in performance, by one which only 
uses {S t , 7r t , t} for t > 1. This can be expressed as an optimal 
quantization policy which only uses {S t ,i} to generate an 
optimal quantizer, where the quantizer uses 7r f to generate the 
quantization output at time t. o 

For a further case where the decoder's memory is limited 
or imperfect, the results apply by replacing the full informa- 
tion at the receiver considered so far in our analysis with 
the limited memory under additional technical assumptions on 
the decoder's update of its memory (in particular, $15[ in the 
proof of Theorem 12 . 5 1 does not apply in general). However, an 
equivalent result of Theorem 12.41 applies also for the limited 
memory setting. Such memory settings have been considered 
in |E2, lED and (36). 



III. Multi-Terminal (Decentralized) Setup 

A. Case with memoryless sources 

Let us first consider a special, but important, case when 
{xt, t > 0} is an independent and identically distributed (i.i.d.) 
sequence. Further, suppose that, the observations are given by 

Vt = 9 l (xt,rl), (7) 

for measurable functions g l ,i — 1,2, with {r},r^} (across 
time) an i.i.d. noise process. We do not require that r\ and r\ 
are independent for a given t. We note that our result below is 
also applicable when the process {r^rf } is only independent 
(across time), but not necessarily identically distributed. 

One difference with the general setup considered earlier in 
Section I is that we require the observation spaces Y l , i = 1,2, 
to be finite spaces (X can still be Polish). 

Suppose the goal is again the minimization problem 

T-l 

inf mfi^^Vc^)], (8) 

4=0 

over all causal coding and receiver decision policies. 

We now make a definition. In the following 1 e denotes the 
indicator function of an event E. 

Definition 3.1: We define the class of non-stationary mem- 
oryless team policies at t > as follows: 

n WSM :={n com P:P(q 4 |y [0 , t] ) 

= P(q}\yt,t)P((g\yt,t) = l{gi=Q t 1 fe t 1 )} 1 {9 t 2 =Q?fe?)}' 

Qj:Y*->-A4i = l,2,t>oj, (9) 

where, in the above, {Ql,Qt} are arbitrary measurable func- 
tions. 

Theorem 3.1: Consider the minimization of (0. An opti- 
mal composite quantization policy over all causal policies is 
an element of H NSM . Such a policy exists. o 

The proof is presented in the Appendix. 

Hence, an optimal composite quantization policy only uses 
the product form admitted by a non-stationary memoryless 
team policy. It ignores the past observations and past quan- 
tization outputs. We note that, the proof also applies to the 
case when the source is memoryless, but not necessarily i.i.d.. 
One may ask why feedback could be useful when the source 
is i.i.d.. Feedback may be useful for at least two reasons: 
(i) Feedback can be used as a signaling mechanism for the 
encoders to communicate with each other (which we discuss 
further in the next subsections), and (ii) Feedback can provide 
common randomness to allow a convexification of the space of 
possibly randomized decentralized encoding strategies. Con- 
sider the optimization problem discussed in (JHJ 

T-l 

J{Tl comp ) = miE%° mp <~<lY, c(x u v t )}. 
7 t=o 

The function J (TV) is concave in the choice of a team policy 
II (see Theorem 4.1 in |57| for the case with T = 1, the proof 
also holds for the current setting). As such, if the space of joint 
encoding policies is convexified by common randomness, an 
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optimal solution would exist at an extreme point; which in turn 
does not require a use of common randomness. This explains 
why common randomness generated by past quantization out- 
puts does not present further benefit in the current setting. 

We note before ending this subsection that if there is an en- 
tropy constraint on the quantizer outputs, then feedback might 
be useful for finite horizon problems as it provides common 
randomness, which cannot be achieved by time-sharing in a 
finite-horizon problem. [40 1 observed that randomization of 
two scalar quantizers (operationally achievable through time- 
sharing) is optimal in causal coding of an i.i.d. source sub- 
ject to distortion constraints, which also applies in the side 
information setting of ll47l . On the other hand, for the zero- 
delay setting, when one considers the distortion minimization 
problem subject to an entropy constraint, [25| observed that the 
distortion-entropy curve is non-convex, leading to a benefit of 
common-randomness for achieving points in the lower convex 
hull of this curve. Further relevant discussions on randomiza- 
tion and optimal quantizer design are present in [19] and l56l . 

B. Case with Markov sources: A counterexample with signal- 
ing 

We now consider Markov sources and exhibit that it is, in 
general, not possible to obtain a separation result of the form 
presented for the single-terminal case. 

We will consider a two-encoder setup for the following 
result, where the encoders have access to the feedback from 
the receiver (Fig. [TJ. We have the following result. 

Proposition 3.1: Consider the setup in (HJ-© and let tt\ = 
P(xt\y* t j), Xt G X, i = 1, 2. An optimal composite quantiza- 
tion policy cannot, in general, be replaced by a policy which 
only uses {q[o,t-i], tt|} to generate q\ for i = 1,2. o 

Proof. It suffices to produce an instance where an opti- 
mal policy cannot admit the separated structure. Toward this 
end, let z\ 1 22, 23 be uniformly distributed, independent, binary 
numbers; xq,xi be defined by: 



such that = zi,xo(2) = Z2,xo(3) = x$(A) — 0. Let the 

observations be given as follows: 

y\ = g 1 {x t ) = x t (l) © x t (3) © x t {4) 

y 2 =g 2 (x t ) =x t {l)®x t {2), t = 0,l. 

where © is the x-or operation. That is, 

vl = [ z i] > Vo = [ z i © z a] , 
y\ = h © z 3 ] , y\ = [0] 

Let the cost be: 







"0" 


Z2 










, Xx = 


Z2 


_0_ 




- Z 3. 



E 



(so (4) - E[x {A)\ m }) 2 + 0r x (4) - E[xi(4)\q [0A] }) 2 



That is, the cost is E[(z 3 — _E[z3|q[ 0i i]]) 2 ], where q\ are the 
information bits sent to the decoder for t — and 1. 



We further restrict the information rates to satisfy: |.Mj| = 
| .Mi | = \M\\ = 2, \Mq\ = 1. That is, the encoder 2 may 
only send information at time t = 1. 

Under arbitrary causal composite quantization policies, a 
cost of zero can be achieved as follows: If the encoder 1 sends 
the value Z\ to the receiver, and at time 1, encoder 1 transmits 
Z2 © 23 and encoder 2 transmits Z2 (or z\ © 22), the receiver 
can uniquely identify the value of 23, for every realization of 
the random variables. 

For such a source, an optimal composite policy cannot be 
written in the separated form, that is, an optimal policy of 
encoder 2 at time 1 cannot be written as /ii(qo, n 2 ), for some 
measurable function hi. To see this, note the following: The 
conditional distribution on x\ at encoder 2 at time 1 is such 
that the conditional measure on (2:2,2:3) is uniform and inde- 
pendent, that is -P(z2 — a,, z^ — b\z% © 22) = (1/4) for all 
values of a, b. If a policy of the structure of hi is adopted, 
then it is not possible for encoder 2 to recall its past obser- 
vation to extract the value of 22- This is because, ir 2 will be 
a distribution only on 2:2 and 2:3, which will be uniform and 
independent, given zi © 22. Thus, the information y 2 , will not 
be available in the memory and the receiver will have access to 
at most 22 © 23 and zi and P(z2, 23)21 © 22) (the last variable 
containing no useful information). The optimal estimator will 
be £[23] = 1/2, leading to a cost of 1/4. 

o 

C. Discussion: Connections with team decision theory 

In this subsection, we interpret the results of the previous 
subsections. We first provide a brief discussion on information 
structures in a decentralized optimization problem: Consider a 
collection of decision makers (DMs) where each has access to 
some local information variable. Such a collection of decision 
makers who wish to minimize a common cost function and 
who has an agreement on the system (that is, the probability 
space on which the system is defined, and the policy and 
action spaces) is said to be a team. Such a team is dynamic if 
the information of one DM is affected by the policy of some 
other DM. If there is a pre-specified order of action for the 
DMs, the team is said to be sequential. Witsenhausen [50 1 pro- 
vided the following characterization of information structures 
in a dynamic sequential team: Under a centralized information 
structure, all DMs have the same information. If a DM's, say 
DM 3 , information is dependent on the policy of another DM, 
say DM fc , and DM J does not have access to the information 
available to DM fc , this information structure is said to admit 
a non-classical information structure. A decentralized system 
admits a quasi-classical information structure, if it is not non- 
classical. 

In a decentralized optimization problem, when the informa- 
tion structure is non-classical, DMs might choose to commu- 
nicate via their control actions: This is known as signaling in 
decentralized control (see for example |54|). 

With the characterization of information structures above, 
every lossy coding problem is non-classical, since a receiver 
cannot recover the information available at the encoder fully, 
while its information is clearly affected by the coding policy 
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of the encoder. However, in an encoding problem, the prob- 
lem itself is the transmission of information. Therefore, we 
suggest the following: Signaling in a coding problem is the 
policy of an encoder to use the quantizers/encoding functions 
to transmit a message to other decision makers, or to itself 
to be used in future stages, through the information sent to 
the receiver. In the information theory literature, signaling has 
been employed in coding for Multiple Access Channels with 
feedback in lfT4l . Ifl2l and ||44| . In these papers, the authors 
used active information transmission to allow for coordination 
between encoders. 

The reason for the negative conclusion in Proposition 13. H is 
that in general for an optimal policy, 

^(?tkt. <llo,t-i].I/[o,t-i]) + p (<ll\<,<llo,t-i]), (10) 
when the encoders have engaged in signaling (in contrast with 
what we will have in the proof of the separation results). The 
encoders may benefit from using the received past observation 
variables explicitly. 

Separation results for such dynamic team problems typically 
require information sharing between the encoders (decision 
makers), where the shared information is used to establish 
a sufficient statistic living in a fixed state space and which 
admits a controlled Markov recursion (hence, such a sufficient 
statistic can serve as a state for the decentralized system). 
For the proof of Theorem 12.51 we see that E t forms such a 
state. For the proof of Theorem 13. U we see that information 
sharing is not needed for the encoders to agree on a sufficient 
statistic, since the source considered is memoryless. Further- 
more, for the multi-terminal setting with a Markov source, a 
careful analysis of the proof of Theorem l3.1l (see ( f2Tb . d24l i and 
the subsequent discussion) reveals that if the encoders agree 
on P(dxt |yro,t-il) through sharing their beliefs for t > 1, 
then a separation result involving this joint belief can be ob- 
tained. See [54 1 for a related information sharing pattern and 
discussions. Further results on such a dynamic programming 
approach to dynamic team problems are present in 0, II32L 
11391 , 11351 , among other references. 

Remark 3.1: In the context of multi-terminal systems, for 
the computation of the capacity of Multiple Access Channels 
with memory and partial state feedback at the encoders, a 
relevant discussion has been reported in [13] (section V). This 
is in the same spirit as our current paper in that, iTPH obtains 
an optimality result when the channel is memoryless; and 
points out the difficulties arising in the case of channels with 
memory in view of intractability of the optimization problem: 
One cannot identify a finite dimensional sufficient statistic for 
the encoders to use. Along a relevant direction, Section III.D 
of l38ll . in the context of real-time coding, discusses the issue 
of the growing state space. o 

IV. Application to Linear Quadratic Gaussian 
(LQG) Estimation Problems 

Consider a Linear Quadratic Gaussian setup, where a sen- 
sor quantizes its noisy information to an estimator. Let x t G 
W 1 , yt G R m , and the evolution of the system be given by the 
following: 

x t +i = Axt+wt, 



fit 



Cx t + r t , 



(11) 



Here, {wt, r t } is a mutually independent, zero-mean Gaussian 
noise sequence with W — E[wtw' t ], R = E[r t r' t ] (where for a 
vector to, w' denotes its transpose), xq is a zero mean Gaussian 
variable, and A, C are matrices of appropriate dimensions. 
Suppose the goal is the computation of 

T-l 

inf miE^ omp ^[y2(x t -v t yQ(x t -v t )}, (12) 

t=0 

with i/q denoting a Gaussian distribution for the initial state, 
Q > a positive definite matrix (See Figure |2j. 

The conditional measure 7r f = P(dxt\ym t t\) is Gaussian 
for all time stages, which is characterized uniquely by its 
mean and covariance matrix for all time stages. We have the 
following. 

Theorem 4.1: For the minimization of the cost in ( TT2l , any 

causal composite quantization policy can be replaced, without 
any loss in performance, by one which only uses the output of 
the Kalman Filter and the information available at the receiver. 

o 

Proof. The result can be proven by considering a direct 
approach, rather than as an application of Theorems 12.41 and 
12.51 (which require bounded costs, however, this assumption 
can be relaxed for this case), exploiting the specific quadratic 
nature of the problem. Let || • | \q denote the norm generated by 
an inner product of the form: (x, y) = x'Qy for x, y G R™ for 
positive-definite Q > 0. The Projection Theorem for Hilbert 
Spaces implies that the random variable x t — E[xt\y[o } t\] is 
orthogonal to the random variables {j/[o,t], 9[o,t]}> where q[o,t] 
is included due to the Markov chain condition 

p (dx t \y[o,t),Q[o,t]) = P(dx t \y[ ,t])- 
We thus obtain the following identity. 

E[\\x t - E[x t \q [Qit] ]\\ 2 Q ] = E[\\x t - E[x t \y m ]f Q ] 



+E 



E i x t\y[o,t]} - E 



E [xt\y\o,t\. 



1[o,t] 



Q 



The second term is to be minimized through the choice of 
the quantizers. Hence, the term m t :— E[xt\y\o t t]], which is 
computed through a Kalman Filter, is to be quantized (see 
Figure |2). Recall that by the Kalman Filter (see [33 1) with 



E i_i = -E^o^o] 



and for t > 

E 



t+i\t = AE t \ t _iA' + W 

-(AE t | t _ 1 C7')(C£ t | t _ 1 <7 / + Ry^CZ^A'), 

the following recursion holds for t > and with m_i = 0: 

rht = Arht-i 

+E 4 | t _ 1 C7'(C7E 4 | t _ 1 C7' + R)-\CA{x t -x - m t -i) + r t ). 

Thus, the pair (m t ,'£it\t-i) is a Markov source, where the 
evolution of Tit\t-i is deterministic. Even though the cost to be 
minimized is not bounded, since m t itself is a fully observed 
process, the proof of Theorem l2.4l can be modified to develop 



s 



the structural result that any causal encoder can be replaced 
with one which uses (m t ,I^t\t~i) an d the past quantization 
outputs (this result can be proven also using Theorem 1 of 
Witsenhausen [52], since this source is fully observed by the 
encoder). Likewise, the proof of Theorem [23] shows that, for 
the fully observed Markov source (rat,£ t | t _i), any causal 
coder can be replaced with one which only uses the conditional 
probability on m t and the realization (rht, £ t | t _i, t) at time t. 
o 

Thus, the optimality of Kalman Filtering allows the encoder 
to only use the conditional estimate and the error covariance 
matrix without any loss of optimality (See Figure [2j, and the 
optimal quantization problem also has an explicit formulation. 
The above result is related to findings in |[T5l (also see J5j and 
[18 1), and partially improves them in the direction of Markov 
sources. 



t 



KF Encoder 



q 



Linear System 



Estimator 



Fig. 2: Separation of Estimation and Quantization: When 
the source is Gaussian, generated by the linear system (TTTl l. 
the cost is quadratic, and the observation channel is Gaussian, 
the separated structure of the encoder above is optimal. That 
is, first the encoder runs a Kalman Filter (KF), and then 
causally encodes its estimate. For one-shot and independent 
observations setups, this result was observed in |5), Q, 0, 
1 1 5 1 , and [18]. Our result shows that, an extension of this result 
applies for the optimal causal encoding of partially observed 
Markov sources as well. 

We note that, the above result also applies to the settings 
when a controller acts on the system, that is, there exists for 
ut G K m and a matrix B such that xt+i = Axt+But+wt- In 
this case, the well-known principle of separation and control 
in control theory allows the above results to be applicable. In 
particular, the conditional estimation error is not affected by 
the control actions. 

V. Conclusion 

For the optimal causal coding of a partially observed Markov 
source, the structure of the optimal causal coders is obtained 
and is shown to admit a separation structure. We observed in 
particular that separation of estimation ( conditional probabil- 
ity computation) and quantization (of this probability) applies, 
under such a setup. We also observed that the real-time de- 
centralized coding of a partially observed i.i.d. source admits 
a separation. Such a separation result does not, in general, 
extend to decentralized coding of partially observed Markov 
sources. 

The results and the general program presented in this paper 
apply also to coding over discrete memoryless noisy channels 
with noiseless feedback. We note that Walrand and Varaiya 



1 46 1 considered the noisy channel setting in their analysis in 
the presence of noiseless feedback. 

The separation result will likely find many applications in 
sensor networks and networked control problems where sen- 
sors have imperfect observation of a plant to be controlled. 
One direction is to find explicit results on the optimal poli- 
cies using computational tools. One promising approach is 
the expert-based systems, which are very effective once one 
imposes a structure on the designs, see [26] for details. 

One further problem is on the existence and design of op- 
timal quantizers. Existence of optimal quantizers, even in the 
context of vector quantization for W 1 -valued random vari- 
ables, requires stringent conditions. Such proofs typically have 
the form of Weierstrass theorem: A lower semi-continuous 
function over a compact set admits a minimum. Existence 
results for optimal quantizers for a one-stage cost problem 
has been investigated by Abaya and Wise Q and Pollard 14H 
for continuous cost functions which are non-decreasing in the 
source-reconstruction distance. Il56l obtained existence results 
for more general cost functions under the restriction that the 
code bins/cells are convex and the source admits a density 
function. For dynamic quantizers, ll58l established the exis- 
tence of optimal quantization policies under the assumption 
that the quantizers admit the structure suggested by Theorem 
12.51 for fully observed Markov sources and the codecells are 
convex for a class of Markov sources. Also for dynamic vector 
quantizers, ifTTIl investigated the existence results when there 
is a bound on the quantizer bins. 

Theorem 12.51 motivates the problem of optimal quantiza- 
tion of probability measures. This remains as an interesting 
problem to be investigated in a real-time coding context, with 
important practical consequences in control and economics 
applications. With a separation result paving the way for an 
MDP formulation, one could proceed with the analysis of 
ifTTl with the evaluation of optimal quantization policies and 
existence results for infinite horizon problems. Toward this 
direction, Graf and Luschgy, in [22| and [23 1, have studied 
the optimal quantization of probability measures. 

One related question to be pursued further is the following: 
When is there an incentive for signaling in coding problems? 
When the observations are correlated for sources with memory 
or when the real-time coding of possibly independent sources 
over a general MAC type channel is considered, there may be 
an incentive for signaling. Further results on this will provide 
some light on some outstanding problems such as the capacity 
of MAC channels with feedback |[T4| . 

Appendix 

A. Proof of Theorem \2.4\ 

We transform the problem into a real-time coding problem 
involving a fully observed Markov source. At time t = T — 1, 
the per-stage cost function can be written as follows, where 
7 t ° denotes a fixed receiver policy: 

£[c(xt,7t°0?[o,t]))k[o,t-i]] 

= y2 p (lt = %[o,t-i])( / P(dx t \q[o,t-i],k) 
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xc(£t,7t°(9[o,t-i],fc)) 

/ y2 P ( dx tiQt = k\q[o,t-i])c(xt,"ft{q[o y t-i],k)) 
Jx M t 

/ / y2P(dx t ,qt = fc,rf7r t |g[ , t _i]) 
Jv(x) Jx Mt 

xcCa^TtfapM-i],*)) 

/ / y^ p ( d7r ti9[o,t-i]) 

./-p(x) Jx Mt 

xP(dx t \Tr t )P(q t = fc|7r 4 ,g[o,t-i])c(a; t ,7 t (g[o,t_i],fc)) 
/ y]- p ( d 'Tt|Q , [o,t-i])-P(9t = fckt,9[o,t-i]) 



< F(n,9[o,t-i] > 9'), Vg' # k, g' e M j . 

These sets are Borel, by the measurability of F on "P(X). Such 
a construction covers the domain set consisting of (71*, Q\o t—i]) 
but with overlaps. It covers the elements in 'P(X) x n*^ 2 M-u 
since for every element in this product set, there is a minimiz- 
ing k 6 Mt (note that Mt is finite). To avoid the overlaps, 
we adopt the following technique which was introduced in 
Witsenhausen [52 1 . Let there be an ordering of the elements 
in Mt as 1, 2, ... , \Mt\, and for k > 1 in this sequence define 
a function Q c t ° 



as: 



<7t = Qt 



x J F(dxt|7r t )c(x f ,7 t (g[o,t_i],/c)) 

= E[F(Trt,q[o,t-i],qt)\q[o,t-i]], 
where, vr t (-) = P(x t G -|j/[o,i]) and 

■F(7rt)?[o,t-i],<?t) = / 7Ti(^)c(a;,7t(?[o,t-i])9t)) 



if (7rt,q [ o,t-i])e/3k\uS t - 1 1 A 



with /3q = 0. Thus, for any random variable g t appropriately 
defined on the probability space, 



(13) 



In the above derivation, the fourth equality follows from the 
property that 

x t O P(dx t \y[ ,t]) O 9[o,t] 
We note that F(-, 7 t °(g[o,t-i], is measurable on 'P(X) 



under weak convergence topology by Theorem 12.31 and the 
fact that the cost is measurable and bounded. 

It should be noted that for every composite quantization 
policy, one may define q t as a random variable on the proba- 
bility space such that the joint distribution of (qt,^t,q[o,t-i]) 
matches the characterization that q t = Qt° mp (y[o,t],q[o,t-i]), 
since 

P{qt\^t,q\o,t-i}) 

P(qt\yio,t],qio,t-i]))P(dy[ ,t]\^t,qio,t-i])- 

The final stage cost is thus written as 

E[F(n t , q[ ,t-i], qt)\q[o,t-i]}, 

which is equivalent to, by the smoothing property of condi- 
tional expectation, the following: 



E 



E[F(-K t , £?[o,i_i], qt)\n, ?[o,i-i]] 



Q[o,t-i] 



Now, we will apply Witsenhausen's two stage lemma 
to show that we can obtain a lower bound for the double 
expectation by picking qt as a result of a measurable function 
of 7T*, <7[o,t-i]- Thus, we will find a composite quantization 
policy which only uses (wt, 9[o,t-il) which performs as well 
as one which uses the entire memory available at the encoder. 
To make this precise, let us fix the decision function 7° at 
the receiver corresponding to a given composite quantization 
policy at the encoder Ql° m% ', let t = T — 1, and define for 
every k e M t - 

Pk := <^ 7r t ,g [0jt _i] : F(ir u q[o,t-i],k) 



E 



E [F(Tr t ,q[o,t-i],qt)\^t,q[o,t-i]} 



Q[o,t-i] 



> E E[F(ir t , q[o,t-i],Q? mp '* '(n, 9[o,t-i]))ki) 9[o,t-i]] 



Q[o,t-i] 



Thus, the new composite policy performs at least as well as the 
original composite coding policy even though it has a restricted 
structure. 

As such, any composite policy can be replaced with one 
which uses only {wt, <Z[o,t-i] } without any loss of performance, 
while keeping the receiver decision function 7° fixed. It should 
now be noted that {7r f } is a Markov process: Note that the 
following filtering equation applies iflOl 

P(dxt\dy[ ,t]) 

f Xt _ t P(dy t \xt)P(dx t \x t _ 1 )P(dx t ^ 1 \dy [0it _x\) 
fx t -i,x t P{dVt\xt)P{dx t \xt-i)P{dx t -i\dy{ Qtt _i ] y 

and P(dy t \n s ,s < t - 1) = f m P(dy t ,dx t \n s , s < t - 1) = 
P(dyt\wt-x). These imply that (see IflOl . see also the discus- 
sion following (15[ ) the following defines a Markov kernel 
(that is, a regular conditional probability measure): 



P(d7rt\TT s ,s < t- 1) = P(eh t \n-i) 



(14) 



We have thus obtained the structure of the encoder for the final 
stage. We iteratively proceed to study the other time stages. In 
particular, since {7r t } is Markov, we could proceed as follows 
(in essence using Witsenhausen's three-stage lemma |52|): For 
a three-stage cost problem, the cost at time t = 2 can be 
written as, for measurable functions 02,03: 

E c 2 (7r 2 , v 2 (g[i, 2 ]),<7[:L2]) 

+E[c 3 (TT 3 ,V 3 (q [1 ^],Q'3 0mP '*(TT3,q[ia])))\qil,2],^[l,2]\ 
9[1,2],7T[1,2] 

Since 

P(dTr 3 ,q2,qi\Tr 2 ,Tri,q2,qi) = P(dir 3 ,q 2 ,qi\TT 2 ,q2,qi), 
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and since under Q^ omp '*, q 3 is a function of 7r3 and q\.q2, 
the expectation above is equal to, for some measurable F<z{.), 
E[F2(iT2, qi-, gi)|7T2, 7Ti, q2 , gi]. Following similar steps as ear- 
lier, it can be established now that a composite quantization 
policy at time 2 uses TT2 and qi and is without any loss. 

By a similar argument, an encoder at time t, 1 < t < T — 1 
only uses (m, Qro,t— l] )• T ne encoder at time t = uses 7To, 
where 7To = z/o is the prior distribution on the initial state. 

Now that we have obtained the structure for a composite 
policy, we can express this as: 

Qt° mP ^uQ[0,t-l]) = Q 910 ''- 11 (7T t ), VTTt.gp.t-l] 



such that the quantizer action Q 9 ! .*- 1 ] G Q(P(X);M ( ) is 
generated using only <Z[o,t-i], and the quantizer outcome is 
generated by evaluating Q^ -*-^ (nt) for every ir t . o 



B. Proof of Theorem 12.51 

At time t = T— 1, the per-stage cost function can be written 



as: 



E[c(xt,"/t(q[o,t]))\q[o,t]\ 

= E \ \ P(dxt\q[o,t-i],Qt)c(x t ,j t (q[o,t]))} 



Thus, at time t = T — 1, an optimal receiver (which is de- 
terministic without any loss of optimality, see ||9]) will use 
P(dx t \q\Q t }) as a sufficient statistic for an optimal decision (or 
any receiver can be replaced with one which uses this sufficient 
statistic without any loss). Let us fix such a receiver policy 
which only uses the posterior P(dxt |<?[o,t]) as its sufficient 
statistic. Note that 



P(dx t \q[o,t]} 



P(dxt\nt)P(d7rt\q[n,t]) 



and further 



P(dn t \q[o,t]) = 



P(fft,cfat|ff[0,t-1]) 

J„ t Piqud-rrtlq^t-x]) 
P{qt\nt,q[o,t-i\)P(dTv t \q[o,t-i\) 
L t P(qtkt,q[o,t~i])P(dn t \q [0t t-i]) 



(15) 



The term P(q t \TT t , q\o,t-i\) is determined by the quantizer ac- 
tion Q t (this follows from Theorem 12.4b . Furthermore, given 
Qt, the relation $15[ is measurable on 7- > ('P(X)) (that is, in 
S t (-) = P(ir t G •|o , ro,t-i])) under weak convergence. 

To prove this technical argument, consider the numerator in 
(fT5l l and note that the function k b ■ P(P(X)) K defined 
as = S(_B) is measurable under weak convergence 

topology as a consequence of Theorem 12.31 for each B G 
^(^(X)). By Theorem 2.1 in Dubins and Freedman [16|, this 
implies that the relation in ([T5T l is measurable on V(V(1C)) 
(since the topology considered in [16 | is not stronger than the 
weak convergence topology, the result in [16| holds in this 
case as well). 

Let us denote the quantizer applied, given the past real- 
izations of quantizer outputs as Q t [0,t_I1 . Note that q t is de- 
terministically determined by (7r f , Qj [o t_11 ) and the optimal 
receiver function can be expressed as 7°(S t ,g t ) (as a mea- 



can be expressed, given the quantizer Q q[0,t 11 , for some Borel 
function G, as G(S t , Q q t [0 ' t " 1] ), where 



G(E t ,QT' t - 1] 



Q«l0,t-1] , ,, 



with 



V Q qi °- t - 1 \Z t ,q t ))= I 77 t (dxt)c(x t ,-f°(E t ,q t )) (16) 



Now, one can construct an equivalence class among the past 
<7[o,t-i] sequences which induce the same S t , and can replace 
the quantizers Q q t [0 ,t ~ 11 for each class with one which induces 
a lower cost among the finitely many elements in each such 
class, for the final time stage. Thus, an optimal quantization 
output may be generated using S t (-) = P(Tr t e 'l^fo.t— ll) 
and TT t . Since there are only finitely many past sequences and 
finitely many S t , this leads to a Borel measurable selection 
of 7r t for every S t , leading to a quantizer and a measurable 
selection in E. t , ir t . 

Since such a selection for Q t only uses S t , an optimal 
quantization output may be generated using S t (-) = P(Tr t G 
■|<7[o,t-i]) an d n t- Hence, G(S t , Q q[0,t ~ 1] ) can be replaced with 
Ft{E t ) for some F t , without any performance loss. 

The same argument applies for all time stages: At time 
t = T — 2, the sufficient statistic both for the immediate 
cost, and the cost-to-go is P(dxt-i\q[o.t-i])^ an d mus for 
the cost impacting the time stage t = T — 1, as a result of 
the optimality result for Qt-i- To show that the separation 
result generalizes to all time stages, it suffices to prove that 
{(E. t ,Qt)} is a controlled Markov chain, if the encoders use 
the structure above. 

Toward this end, we establish that for t > 1, for all B e 
B(V(V(1C))), equation (TT~8T > (displayed on the next page) holds. 

Here, ( flTt uses the fact that P(g f _i |7r f _i, q[o.t~2]) is iden- 
tified by {-P(<i7r t _i|q[o.t_2])7 Qt-i}, which in turn is uniquely 
identified by q[o,t-2] an d Qt-i- Furthermore, the relation in 
(TT~8T > defines a regular conditional probability measure since 
for all B e B(P(X)), 

S t (B)=P(7r t GB|g [0)t _ 1] ) 

= / P{-K t S B,dK t -i\q[a,t-i\) 

= P{n GB|7T t _i)P(d7rt-i|g[ Dl t-i]) 

is measurable in 3t_i, given Qt-i (as a consequence of the 
measurability of (TT~5T > in S t ). Hence, by the result of Dubins 
and Freedman mentioned above (Theorem 2.1 in |16|), we 
conclude that for any measurable function F t of H( 



E[F t (^ t )\^[o,t-i], 1 



^[F t (S t ),Q t )|St-i,Q f -i], 



surable function), given Q 



<?[0,t-l] 

t 



. The cost at time t = T — 1 



for every given Qt-i- Now, once again an equivalence re- 
lationship between the finitely many past quantizer outputs, 
based on the equivalence of the conditional measures 
they induce, can be constructed. With the controlled Markov 
structure, we can follow the same argument for earlier time 
stages. Therefore, it suffices that the encoder uses only (E t , t) 
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P\P(dw t \q[o t t-i]) 6 B P(dir s \q [0>s -i]),Q s ,s < t- 1 

= P ( K J p ( d7r t- d7r «-il9[o,t-i]) e 5 -f (^k[o, s -i]))<3s.s < * - ij 

j*^ ^ , (^7ril 71 "i-i)- p (5 , t-i| 7r t-i)Q'[o,i-2])-P(rf7rt-i|9[o,t-2]) 



= P 



= P 



f„ t _ u „ t p ( d7T t\Trt~i)P(qt-iWt-i,q[o,t-2])P(dTT t -i\q[a,t-2]) 

J 7rt _ 1 P(dn t \n t -i)P(qt-i\n-i,q{o,t-2})P(dn-i\qia,t-2]) 
S„ t _ UWt P(dn t \n-i)P(qt-i\n-i,q[a,t-2])P{dTrt-i\q[o,t-2]) 

P i K J p ( d7r t,dTT t -i\q[o,t-i]) G B P(d7r t _i|Q[o,t_ 2 ]),Qi-i^ 
P(d7rt_i|g[o,t-2]),Qt-i 



e p 
e p 



■P(d7r 8 |g[o, 8 _i]),Q a ,s <i - 1 
P(d7r t _i|g [0!t _ 2] ),Q t _i 



P [P(dnt\q [0 ,t-l]) eP 



(17) 



(18) 



as its sufficient statistic for all time stages, to generate the 
optimal quantizer. An optimal quantizer uses 7r t to generate 
the optimal quantization outputs. 

C. Proof of Theorem 13.71 

The proof is in three steps, (i), (ii) and (iii) below. 
Step (i) In decentralized dynamic decision problems where the 
decision makers have the same objective (that is, in team prob- 
lems), more information provided to the decision makers does 
not lead to any degradation in performance, since the decision 
makers can always choose to ignore the additional information. 
In view of this, let us relax the information structure in such 
a way that the decision makers now have access to all the 
previous observations, that is the information available at the 
encoders 1 and 2 are: 

it = {yt.y[o,t-i],q[o,t-i]} *>i, i = i,2. 

P = {yl}, i = l,2. 

The information pattern among the encoders is now the one- 
step delayed observation sharing pattern. We will show that 
the past information can be eliminated altogether, to prove the 
desired result. 

Step (ii) The second step uses the following technical Lemma. 

Lemma A.l: Under the relaxed information structure in 
step (i) above, any decentralized quantization policy at time 
f, 1 < t < T — 1, can be replaced, without any loss in perfor- 
mance, with one which only uses (tt*, yt, qmt-ii), satisfying 
the following form: 

f(qt|y[o,t]>q[o,t-i]) 

= •Pfe 1 l2/ f 1 ,q[o,t-i])P(<7 t 2 l2/ i 2 ,q[o,t-i]) 

= 1 {9 t 1 =Q 1 fe 1 ,q [ o, t -l])} 1 {g!=Q 2 ( 1 /f ) q[o,t-l])}' (19) 

for measurable functions Q 1 and Q 2 . o 



Proof. Let us fix a composite quantization policy n comp . At 
time t = T — 1, the per-stage cost function can be written as: 

E[ P(dx t \q[ 0tt ])c(x t ,Vt)\q,[o,t-i]\ (20) 

For this problem, P(dxt |qro,t]) is a sufficient statistic for an 
optimal receiver. Hence, at time t = T— 1, an optimal receiver 
will use P(dxt |qro,ti) as a sufficient statistic for an optimal 
decision as the cost function conditioned on qro,t] is written 
as: J P(dxt\cLQ t t\)c(xt,v t ), where v t is the decision of the 
receiver. Now, let us fix this decision policy at time t. We 
now note that ( ETT i. displayed on the next page, follows. 

In d2~n >. we use the relation P(dxt\ym,t-i]) = P(dxt) =: 
ir(dxt), where tt(-) denotes the marginal probability on xt (re- 
call that the source is memoryless). The term P(qt |yro,t] > q[o.t-i] ) 
in (fJTJ is determined by the composite quantization policies: 

f(qt|y[o,i],q[o,t-i]) 

= p {il \vl > y[o,t-i] > ms),t-i])P{q 2 \y 2 , yp.t-i] . qp.i-i]) 

— ^{<?t 1 ='3t omp ' 1 (yt.y[o,t-i] I q[o,t-ii)} 
xl { 9( 2 =Q f c ° mp ' 2 (2/ t 2 1 y[o, t -ii,q[o, t -i])} 
The above is valid since each encoder knows the past observa- 
tions of both encoders. As such, P(dxt\q\p i t\) can be written 
as, for some function T: T(tt, q[o.t-i] 7 Qt° mp (-))- Note that 
qpM-i] appears due to the term P(y[ ,{-i] |q[o,t-i])- Now, 
consider the following space of joint (team) mappings at time 
t, denoted by Q t : 

G t = {* t : * t = {tfj, n : Y l ->M\, i = 1, 2}. (22) 

For every composite quantization policy there exists a distri- 
bution P' on random variables (qt,7r, qro,i-ii) such that 

£"(qtK,q[o,t-i]) = p (q«>y[o,t]Kiq[o,i-i]) 

(Y 1 x¥ 2 )*+ 1 

= ( p ( q t ly[o,t-i])2/i>q[o,t-i] 5 i") 

(Y 1 xY 2 )^ 1 ^ 

xP(g 2 |y[ ,t-i], y 2 , q[ ,t-i], tt) 

xP(y ( 1 , 2 / t 2 )P(y [0 ^ 1] | 7 r,q [0 ^ 1] )) (23) 
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P(dx t \ci[o.t]) = y. P ( dx t>y[o,t]\<l[o,t]) = ^7- T- ; 

^(qt|q[o,*-i]j 

Ey+i p (qt|y[o,t] 1 q[o,t-i])-P(ytl a; t) i3 (^t|y[o,t-i]) j:) (y[o,t-i]|q[o,t-i]) 
Jx,y'+i p (q*ly[o,t],q[o,i-i])-P(ytkt)-P(^t|y[o,t-i])-P(y[o,t-i]|q[o,t-i]) 
Ey+i j3 (qt|y[o,t] ; q[o,t-i]) j:) (ytkt) 7r (^)- p (y[o,t-i]|q[o,t-i]) 
/x : y*+i p (q*ly[o,t] I q[o,i-i])-P(yik*) 7r ( rfa; t)- p (y[o,t-i]|q[o,t-i]) 



(21) 



Furthermore, with every composite quantization policy and 
every realization of yro,t-i], qro,t-i]> we can associate an ele- 
ment in the space ft, l $Py [0if _ 11 ,q rQ such that the induced 
stochastic relationship in ( I23l l can be obtained: 



E F (qt-y[o,t]k.q[o,t-i]) 



•P'(qtK,q[o,t-ij) = 

E 1 {*y [0 . t _ 11 , q[0it _ 1] fe f 1 ,a f 2 ) = (? ( 1 ,9 t 2 )} 
(Y 1 xY 2 )*+ 1 

xP(vh Vt )-P(yp,t-i] K, q[o,t-i]) 

We can thus, express the cost, for some measurable function 
F in the following way: 

£[-F(T.q[o,t-i])*)K)q[o,t-i]]) 

P(*|7r,q [0l t_ 1 ]) 

1 {*=*y [ o, t -i ] .«. [0 , t -i] } P (y[0.*-i] I 71 "' qp.t-1]) 



where 



E 

f l xY 2)t 



Now let t = T — 1 and define for every realization ^ = 
(^j,^) £ ft (with the decision policy considered earlier 
fixed): 



■ ' 1 7T,q[0,t-l 

F{tt, q [0 ,t_i] , *t) < F(tt, q [0 ,i_i] , **) 



As we had observed in the proof of Theorem I2.4I such a 
construction covers the domain set consisting of (tt, qro,t-i] ) 
but possibly with overlaps. Note that for every (tt, q[o.t-i]), 
there exists a minimizing function in ft, since ft is a finite 
set. In this sequence, let there be an ordering of the finitely 
many elements in ft as {*t(l), *t(2), . . . , *t(fe), . • . }, and 
define a function T| as: 

* t (fc)=T t *(7r,q [0 , t _i]), 

if fi",q[o,t-i]^ G /3* t(k) - uj^/^jTj, 

with /3* t (o) = 0- Thus, we have constructed a policy which 
performs at least as well as the original composite quantiza- 
tion policy. It has a restricted structure in that it only uses 
l 71 ": qro,t-i]) t° generate the team action and the local infor- 
mation y\,yl to generate the quantizer outputs. 

Now that we have obtained the structure of the optimal 
encoders for the last stage, we can sequentially proceed to 



study the other time stages. Note that given a fixed tt, {(tt, y t )} 
is i.i.d. and hence Markov. Now, define Tr' t — (n,y t ). For a 
three-stage cost problem, the cost at time t = 2 can be written 
as, for measurable functions 02,03: 

C2(7T 2 > W 2(q[l,2])) 

+^[c3(7r3,«3(q[x,2],Q3(7r3,q[i ) 2])))|7r( 1)2]) q [1> 2]] 

Since P(dn' 3 , q [1)2] \n' 2 , 7^, q[ 1)2 ]) = P(dir' 3 , q [1)2 ] K 2 , q[i, 2 ]), 
the expression above is equal for some P^tt^, q2, qi) for some 
measurable F 2 . By a similar argument, an optimal composite 
quantizer at time t, 1 <t <T — 1 only uses (71", yt, qmt-i])- 
An optimal (team) policy generates the quantizers Q\ , Q \ us- 
ing qro,t-i])7r, and the quantizers use {yj} to generate the 
quantizer outputs at time t for i = 1,2. o 

Step (iii) The final step will complete the proof. At time t = 
T—l, an optimal receiver will use P(dx t |qro,t] ) as a sufficient 
statistic for the optimal decision. We now observe that 

P{dx t \q_[o,t]) = E P ( dx *ly[o,t]) p (y[o,t]|q[o,t]) (24) 

¥*+i 

= P(dxt\yt)P{y[o,t]hlo : t]) 

¥*+! 

= ^ P'Cdxt |y t ) ^ -P(y[ ,s] |q[o,t]) 

Y Y' 

= ^2P(dxt\yt)P(yM[o, t ]) 

Y 

Thus, P(dxt |q[o,t])> is a function of P(yt|q[o,t])- Now, let us 
note that 

D , 1 x P(qt,yt|q[o,*-i]) 

p yt q[o,*] = ^ — 57 1 r 

E yt p (qt»yt|q[o,t-i]) 

P(qt|yt,q[o,t-i])-P(y*|q[o,t-i]) 



E yt p (qt|yt,q[o,i-i])-P(yi|q[o,t-i]) 
p (qt|yt,q[o,t-i])-P(y*) 



(25) 



E yt - p (qt|yt ; q[o,t-i])P(y*)' 

where the term P (qt|yt, qro,t-il) is determined by the quan- 
tizer team action Q c t omp . As such, the cost at time t = T — 
1 can be expressed as a measurable function G(P(yt),Qt)- 
Thus, it follows that, an optimal quantizer policy at the last 
stage, t = T—l may only use P(y*) to generate the quantizers, 
where the quantizers use the local information y\ to generate 
the quantization output. At time t = T — 2, the sufficient 
statistic for the cost function is P(dxt_i|q[o,f-i]) both for the 
immediate cost, and the cost-to-go, that is the cost impacting 
the time stage t = T — 1, as a result of the optimality result 
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for Qt-i and the memoryless nature of the source dynamics. 
The same argument applies for all time stages. 

Hence, any policy without loss can be replaced with one in 
n WSA ' defined in (0. Since there are finitely many policies 
in this class, an optimal policy exists. 
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